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1.  INTRODUCTION 

1  v  The  maximum  likelihood  principle  is  the  basic  and  useful  technique 
for  statistics.  It  has  a  long  history  and  there  is  quite  a  bit  of  litera¬ 
ture  treating  its  asymptotic  properties,  e.g.,  Wald  (1949)  and  LeCam  (1953). 
These  classical  results  are  based  on  the  assumption  that  the  unknown  den¬ 
sity  function  lies  in  a  specified  parametric  family.  However,  if  this 
assumption  is  not  true,  do  similar  results  remain  valid?  Cox  (1961,  1962) 
considered  first  such  a  problem  in  testing  of  separated  families,  (see  also 
Berk  (1966,  1970)).  Huber  (1967)  pointed  out  that  this  problem  is  connected 
with  robust  estimation.  White  (1982)  reviewed  this  problem  and  showed  the 
consistency  and  the  asymptotic  normality  under  the  assumptions  corresponding 
to  the  regularity  conditions  in  the  classical  theory.  Additional  related 
references  are  Akaike  (1973)  and  Foutz  and  Srivastava  (1977). 

In  Section  2  we  give  the  consistency  order  of  the  maximum  likelihood 
estimator  and  of  the  maximum  likelihood  under  the  usual  conditions  with 
additional  assumptions  on  higher  order  derivatives  of  the  specified  densities. 
Further  we  treat  the  testing  problem  of  two  families.  Section  3  is  concerned 
with  the  model  selection.  We  prove  the  strong  consistency  of  BIC  type  cri¬ 
teria  in  a  very  general  setting.  The  inconsistency  of  AIC  will  also  be 
shown.  However,  we  reconsider  the  consistency  in  model  selection  in  Section 
4.  All  proofs  of  the  theorems  will  be  shown  in  Section  5. 


Wl 


2.  OBSERVATIONS  AND  FAMILY  OF  DENSITIES 

Let  n  observations  (which  may  be  multivariate)  x-j . xn  (e  IRd) 

be  independently  and  identically  distributed  as  the  probability  density 
function  g  with  respect  to  a  fixed  measure  v  on  IRd.  Suppose  that 
J | log  g(x) |g(x)dv(x)  <  ».  Next  consider  the  family  of  densities 

M  =  ff(xje) je  e  <B>)  (2.1) 

where  (H)  is  a  convex  set  in  1R*3.  Define  the  quasi-log-likelihood  of  n  obser¬ 
vations  as 

L  ( 0 )  =  n'1  l  £(x.je),  i(x|e)  =  log  f(x|e)  (2.2) 

n  i=l  1 

and  define  the  quasi-maximum  likelihood  estimate  by  e  =  §n.  Recall  the 
Kullback-Leibler  information: 

I(g;f,e)  =  Jg(x)  log{g(x)/f(x|e)}dv  >_  0  (2.3) 

provides  some  closeness  from  g  to  f(-Je).  We  call  eg  and  f ( • | eg )  the  quasi- 
true  parameter  and  the  quasi-true  density  in  M  respectively  when  eg  minimizes 
I(g;f.e)»  e  e  ®,  or  equivalently  eg  maximizes  the  expected  log-likelihood 

e(e)  ■  |g ( x ) 1 og  f(x|e)dx.  (2.4) 

Obviously  if  g(x)  is  exactly  specified  by  M  as  f(x|6g),  then  0g  =  6g. 

Now  we  make  assumptions  on  (g,M)  which  will  enable  us  to  study  the 
asymptotic  behavior  of  maximum  likelihood  principle. 

ASSUMPTION  A1 .  The  quasi-true  parameter  Qg  is  unique  and  is  an  interior 


point  of  ®. 


ASSUMPTION  A2.  (a)  i  (x | e )  =  3£(x|9)/39  and  i  (x | e )  =  32£.(x|e)/39  30o 
(a, 8  =  1,  ....  p)  are  measurable  with  respect  to  x  e  IRd  for  each  ee® 
and  continuous  with  respect  to  9  for  each  x,  where  £(x|e)  =  log  f (x | 9 ) . 

(b)  | £ ( x | 9 ) | ,  |Aa(x|e)|,  lAag(x|e)|,  | £af ( x |9)£^f(x|e)|  are  dominated 
by  the  integrable  functions  with  respect  to  g(x),  which  do  not  depend  on  e. 

ASSUMPTION  A3.  V ( 9^ )  and  W(0g)  are  positive  definite  where 

2 

V(0)  =  E  [^r*(x|e)— !y£(xje)]  and  W(e)  =  -E_[— ^ ^-y£(x|e)]. 
y  36  y  3839 

ASSUMPTION  A4.  There  exists  the  quasi-maximum  likelihood  estimate  of 
§n  which  tends  to  8g  with  probability  1. 

ASSUMPTION  A5.  (a)  *a0Y(x|e)  =  83*(x| e)/3ea36e39 ,  («,b,y  =  1 ,  . . . ,  p) 

are  measurable  with  respect  to  x  for  each  0. 

2  2 

(b)  | £ ( x | 9 ) |  ,  |i  (x|9)|  and  Uag(x|e)|£  are  dominated  by  integrable 

functions  with  respect  to  g,  which  do  not  depend  on  9. 

Remark  cm  A4.  (i)  case  g  e  M:  Several  sufficient  conditions  ensuring 

the  assumption  A4  are  known,  e.g.,  Wald  (1949),  Huber  (1967)  and  5e.2  of 
Rao  (1973). 

(ii)  case  g  i  M:  White  (1982)  showed  that  A1-A3  with  A4':  ®  is  com¬ 
pact  ensure  A4.  Conditions  by  Huber,  derived  without  assuming  that  g  is 
exactly  specified,  suffice  A4.  Also  Wald's  assumptions  can  be  modified  to 
this  situation  by  substituting  df(x,0Q)  for  g(x)dv  and  eQ  for  8g,  which 
meet  A4. 


If  the  true  density  is  completely  unknown,  any  of  our  conditions  is 
not  checked.  However,  M  gives  a  good  approximation  to  g  and  m  meets  condi- 


tions  A1-A5  when  g(x)  =  f(x|eQ),  then  (g,M)  will  satisfy  A1-A5. 

The  assumptions  A1-A4  are  corresponding  to  the  regularity  conditions 
in  the  classical  theory.  They  ensure  the  strong  consistency  of  @n  on  Lp ( e ) . 
Further,  the  asymptotic  normality  of  0n  can  be  shown,  e.g..  White  (1982) 
and  Foutz  and  Srivastava  (1977).  If  we  assume  A5  additionally,  the  con- 
sistency  order  may  be  evaluated  as  in  the  following  theorem  which  will  play 
a  key  role  in  studying  model  selection  criteria. 

THEOREM  1.  Let  n  independent  observations  come  from  the  distribution 
with  density  g  and  (g,M)  meet  A1-A5  where  M  is  defined  in  (2.1).  The 
orders  relating  to  the  quasi -maximum  likelihood  estimates  en  and  the  log- 


likelihood 

are: 

(i) 

en  -  eg  +  0((n_1log  logn)1/2) 

a.s 

(ii) 

Ln(e)  *  Ln(0g)  +  0(n_1log  log  n) 

a.s 

(iii) 

Ln(e)  =  e(eg)  +  0((n”1  loglogn)1/2) 

a.s 

where  eg  is  defined  in  A1 ,  Ln(e)  in  (2.2)  and  e(e)  in  (2.4). 

Note  that  Theorem  1  is  new  even  if  g  is  exactly  specified  by  M.  Under 

non-regular  case  the  consistency  order  of  en  may  be  different  from 
-1  1 12 

0((n  log  log  n)  '  ).  However,  (ii)  still  remains  valid  as  long  as  the 
consistency  order  of  ©n  is  faster  than  0(  (n~^  log  log  n)1//2)  because  the 
order  of  (ii)  is  based  on  the  law  of  iterated  logarithm  for  £(xn|e)  +  ...  + 

l(xJeg>- 

Cox  (1961,  1962)  introduced  the  problem:  Which  family  specifies  the 
true  density?  He  proposed  the  corrected  likelihood  ratio  test.  Our  problem 
is:  Which  family  is  closer  to  the  true*  density?  We  take  a  simple  likelihood 
ratio  approach.  Let  =  {f^(x[e^)|e^  e  ©^}  ( i  =  1 ,  2 )  be  families  of  den- 


si  ties  (which  may  not  be  separated),  and  let  e..  be  maximized  expected  log- 
likelihoods  in  M..  (see  2.4).  Then  test  the  hypothesis 

Hq:  £q  =  e-|  versus  :  eQ  >  (2.5) 

Assume  both  (g )  satisfy  A1-A5.  If  is  true,  from  (iii)  of  Theorem  1 
the  likelihood  ratio 

xn  =  l  1og(f0(Xj  I  ®0)/f -|  (2.6) 

tends  to  infinity  since  n"\  -►  en  -  e,  >  0,  a.s.,  which  implies  the  likeli- 

n  0  1 

hood  ratio  can  asymptotically  find  the  family  closer  to  g.  To  make  more  de¬ 
tailed  discussion,  we  get: 

THEOREM  2.  Consider  the  testing  hypothesis  (2.5)  under  the  conditions 
A1-A5.  Then  the  likelihood  ratio  test  is  consistent.. 

2  -1/2 
Let  a  be  the  asymptotic  variance  of  n  '  *n.  Then  if  d  =  | eQ  -  e -j  | /o 

is  large,  we  can  discriminate  the  families  by  using  small  data.  However, 

when  d  is  small  we  need  a  large  data.  Hence  in  such  a  case  it  would  be 

preferable  to  develop  similar  discussion  as  the  corrected  likelihood  ratio 

proposed  by  Cox.  See  also  Kent  (1986). 


3.  MODEL  SELECTION 


We  have  shown  that  the  likelihood  ratio  test  is  useful  when  two  models 
are  under  consideration.  When  one  has  many  models  as  the  candidates  for 
the  true  density  g,  model  selection  procedures  are  utilized.  Consider  k 
models  =  {f^(x|e^)|e^  e0^}.  We  treat  here  the  criteria  given  by  the 
following  forms: 


IC(i)  *  -2nL^(e1)  +  cnPi,  (i  =  l,...,k)  (3.1) 

where  ^ ,  L^(e^)  and  pi  are  respectively  the  quasi-maximum  likelihood  esti¬ 
mate,  the  quasi-maximum  log-likelihood  divided  by  n  and  the  number  of  para¬ 
meters  under  the  model  .  The  model  minimizing  (3.1)  will  be  regarded  as 
the  best  model.  Akaike  (1973)  proposed  to  take  cp  =  2  (AIC),  Schwarz  (1978) 
and  Rissanen  (1978)  proposed  cn  =  logn(BIC),  and  Hannan  &  Quinn  (1979)  as 
cn  =  K log  log  n(K > 0).  Suppose  the  expected  log- likelihood  of  is  largest 
among  those  of  k  families.  By  Theorem  2,  IC(i)  (i  =  l,  ...,k)  will  take 
almost  surely  its  minimum  value  at  IC(1)  for  large  n  if  lim  n~^cn  =  0.  Every 
criterion  above  satisfies  this  condition.  Hence  we  can  find  asymptotically 
which  model  is  closest  to  q.  Further  we  treat  the  case  that  the  closest  model 
M-j  (M;  say)  is  divided  into  several  subfamilies  (nested  case). 

Suppose  the  quasi -true  parameter  vector  e  can  be  written  as 

•  3 


®g-  (®*  «••  •  j  9q»  0>  . . . ,  0) ,  0*y*O,  ..., 


and  suppose  zero  vector  is  an  interior  point  of  ©.  This  assumption  implies 


that  eq+i ,  ....  ep  are  redundant.  We  call  J*  =  {1,  ...,q}  the  quasi-true 

model  and  =  {1,  ...,p}  the  full  model  for  simplicity.  Let  J  be  a  subset 

of  J^.  Then  submodel  of  M  specified  by  J,  say  M(J),  is  defined  by 

{f (x | e( J) )  | e  e  ©}  where  e(J)  =  (0  0e.  0. .  .0e .  0. .  .0e .  0. .  .0) ,  J  =  ..... 

J  ]  J2  Jg,  1  1 


EXAMPLE.  Let  *(x)  =  (2ir)"1/2exp(-x2/2) ,  g(x)  =  l/2{*(x  -  1 )  +  .♦(*  +  1)> 
-1-1  2 

and  y  =  {o  (a  ( x  -  u ) )  |  e  =  ( 0  -j » ©2 )  =  (o  -l,y),  0-j  >  -1,  -°°  <  y  <  °°}. 

Then  eg  =  (1,0),  J*  =  {1},  Jf  =  {1,2},  M({2})  =  {N(y,l)>,  M({1»  =  (N(0,o2)}. 

Suppose  (g,M{J))  meet  A1-A5  and  write  the  quasi -true  parameter  and 
the  quasi-maximum  likelihood  estimate  by  6jg  and  respectively.  Hence 
e[0Jg]  =  e[eg]  if  J  3  J*;  and  <  e[eg]  if  J  ^  J*.  Thus  by  Theorem  2; 

THEOREM  3.  Let  be  the  likelihood  ratio  Lp ( e j )  -  Ln ( © j* ) •  Then 
if  J  3  J*,  An  >_  0  and  An  =  0{n_1  log  log  n) ,  a.s.  If  J  ^  J*,  An  "*■  e(eJg)  * 
e(eg)  <  0. 

THEOREM  4.  Let  Jn  be  a  subset  of  minimizing  I C ( J )  of  (3.1).  If  cn 
satisfies  both 

lim  n'V  =  0  and  lim  c  /log  log  n  =  +°°,  (3.2) 

n-x»  n  n-x*> 

then  Jn  is  a  strongly  consistent  estimator  of  the  quasi-true  model  J*,  i.e., 

lim  J  =  J,  a.s. 
n-v°°  n 

Note  that  if  we  relax  the  latter  condition  of  (3.2)  as 

lim  n~  c  =  0  and  lim  c  =  +«,  (3.3) 

n->®  n  n-x®  n 

*  A  _ 

then  J  is  a  weakly  consistent  estimator  of  J*,  i.e.,  lim  P[ J  = J*J  =  1. 
n  n-x»  ii 

However,  we  need  extensive  calculation  for  getting  Jn  when  p  is  large 
because  there  are  2P-1  non-empty  subsets  of  J^.  Our  alternate  procedure 
saves  computation.  Let  J  .  =  {1 .... ,  j-1 ,  j+1 ,  . . . ,  p)  for  j  e  .  Define 

Jn  =  {j  e JfiIC(d_j)  >  IC(Jf)}. 

Then  by  the  similar  lines  of  the  proof  of  Theorem  4,  we  get: 


THEOREM  5.  If  cn  satisfies  (3.2)  or  (3.3),  then  Jn  is  also  a  strongly 
or  weakly  consistent  estimator  of  J*. 


AIC  is  not  consistent  because  cn  =  2  does  not  meet  (3.2)  nor  (3.3). 
It  will  overestimate  the  quasi-true  model.  The  probability 
lim  =  J]  >  0*  for  J  oJ*  will  be  expressed  using  positive  linear 

combinations  of  independent  chi-square  variates,  however,  its  formula  is 
hard  to  evaluate  in  a  simple  form. 


4.  DISCUSSION 


Our  results  are  based  on  the  i.i.d.  assumption.  However,  the  Theorems 
1-5  still  remain  valid  even  if  n  observations  have  weak  dependency  which 
ensures  the  central  limit  theorem  and  the  law  of  iterated  logarithm.  Hence 
our  results  are  quite  general. 

Next  we  try  to  reconsider  the  consistency  in  model  selection  problem. 
From  the  point  of  view  that  the  model  Is  an  approximation  with  finite  pa- 
ramaters  to  the  true  density  with  infinite  parameters  (see  Shibata  (1980)), 
the  quasi- true  model  under  M  becomes  the  full  model  in  many  cases.  Then  AIC 
also  becomes  consistent  since  it  does  not  underestimate  the  quasi-true  model 

Our  observations  do  not  provide  the  difference  of  AIC  and  BIC  in  this  case. 
Unfortunately  our  observations  provide  no  difference  of  AIC  and  BIC  in  this 

case. 

The  purpose  of  the  model  selection  may  be  to  find  the  model  by  which 
we  can  get  some  good  prediction  for  future  observation,  not  the  model  which 
provides  a  good  fitting  for  given  observations.  Recall  AIC  is  proposed  as 
an  estimator  of  the  predictive  density.  The  consistency  is  one  criterion 
for  classifying  the  model  selection  procedures,  and  this  criterion  may  not 
always  lead  a  suitable  conclusion  in  practical  situation. 
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5.  APPENDIX 


Proof  of  Theorem  1.  From  A1  and  A4,  exists  and  is  an  interior  point 
of  ®  for  large  n.  Employing  Taylor's  expansion  we  get 

0  =  3Ln(e)/ae  =  aLn(eg)/ae  -  Wn(eg)(en  -  eg)  +  rn  (5.1) 

where 

Wn(e)  =  -32Ln(e)/363eT:  pxp,  rn  =  (rln,  ,  rpn)T, 

rin  =  <Sn- eg>TCa2<al7,-nti>>/3e3eT]<Sn- eg>* 

e  =  eg  +  e(en  -  eg),  0  <  e  <  1. 

By  the  law  of  iterated  logarithm  and  A3,  A5,  we  have 

Ln(0g)/30  =  0( (n-1  log  logn)1/2),  a.s. 

Wn(0g)  a  W(eg)  +  0((n_1log  log  n)1/2),  a.s.  (5.2) 

because  EaLn(eg)/3e  *  3Eg«.(X | eg)/a©  =  0.  From  (5.2)  and  A3,  Wn(eg)  is 
positive  definite  when  n  is  large.  Solving  (5.1), 

5n  -  9g  *  w„(«g)',(3Ln(9)/9en„)- 

By  A5  there  exist  H:  an  integrable  function  with  respect  to  g  and  K  >  0 
such  that  for  any  a,  e,  y  =  1,  ...,  p 

|33Ln(e)/aea3eBaeY|  <  n"1  J  H(Xi)  <  K, 

which  implies  rn  =  0(l)(©n~  eg),  a.s.  Thus 

®n  "  eg  *  0( (n-1 leg  log  n)1/2),  a.s. 

Again  by  the  law  of  iterated  logarithm  we  know 
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Ln(eg)  =  ug  +  O((n_1log  logn)1^2),  a.s. 
where  yg  =  e(eg)  is  defined  in  (2.4).  From  (i)  and  A2, 

Ln(e)  "  Ln(®g}  =  (®n  "  eg)T9Ln(eg)/99  +  1/2(®n  -  eg)T92Ln(e)/363eT(5n  -  eg) 
=  0(n’^log  log  n) ,  a.s. 

Hence,  Ln(e)  =  Ln(eg)  +  Ln(e)  -  Lp ( eg )  =  0( (n-1  log  log  n)1/2) ,  a.s. 


Proof  of  Theorem  2.  The  asymptotic  normality  of  the  likelihood  ratio 
*n  of  (2.6)  is  known  by  Foutz  and  Srivastava  (1977)  as 

n  ^2An  N[eq  -  ,  o2] 

where  o  =  Eg[log{fg(X|8gg)/f.j(X|8.|g)}]2  and  8jg  (i  =  0,1)  are  the  quasi-true 
parameters.  Using  a  consistent  estimator  of  o2  as 

5n  =  n’1 1I1 Clog{f0<xi I S0>/f 1 (X1 I «i ) >32 • 
we  make  the  rejection  region  of  Hg  by 

Rn  '  txn  >  ^V«} 

where  is  the  upper  lOOa-percent  point  of  the  standard  normal  distribution. 
Under  H-j ,  or  equivalently  y  =  eq  -  e-|  >  0  , 

P[Rn“>lHi3  -  P[n-1/Z(xn  -  ny)  »  cncn  -  n1/2„|H,]  -  1,  (n~) 

% 

because  n"1/2(xn-ny)  N[0,o2]  and  5n^n  -  n1/2u  —  1n  P> 


♦.»  ;.t  -t  .t  Vtot 
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Proof  of  Theorem  4.  If  J  ^  J* ,  then 
IC(J)  -  IC(J*)  =  (#J  -  q)cn  -  2n{Ln(6j)  -  L^*)} 

*  log  log  n[(#J  -  q)cn/log  log  n  -  2n( log  log  n)"1{Ln(eJ)  -  L^Gj*)}] 

-*•  +~,  a.s.  (Theorem  2), 

since  #J  -  q  >  0  and  lim  c  /log  log  n  =  +«.  This  implies  for  large  n, 

n-n»  n 

IC(J)  >  IC(J*),  a.s.  Hence,  J„  c 

n  — 

If  J  ^  J*, 

I C ( J )  -  IC(0J  =  2n[Ln(iJ*)-Ln(0J)-(#J-q)cn/(2n)]  -  +~,  a.s. 

since  L  (e,*)  -  L  (e.)  -*■  c  >  0  and  lim  n  ^c  =0.  Hence,  J  3  J+  for  large  n. 
n  j  n  j  n  n  — 
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